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Abstract 

Living organisms are the most complex, interesting and significant objects regarding 
all substructures of the universe. Life science is regarded as a science of the 21st century 
and one can expect great new discoveries in the near futures. This article contains an 
introductory brief review of genetic information, its coding and translation of genes to 
proteins through the genetic code. Some theoretical approaches to the modelling of the 
genetic code are presented. In particular, connection of the genetic code with number 
theory is considered and the role of p-adic numbers is underlined. 

1 Introduction 

Francis Crick (1916-2004), who together with James Watson (1928-) discovered double he- 
licoidal structure of DNA, in 1953 announced "We have discovered the secret of life" [1]. 
However, if it was a secret of life, then life has still many secrets. One of them is the genetic 
code. Although genetic code was finally experimentally deciphered in 1966, its theoretical un- 
derstanding has remained unsatisfactory and new models have been offered from time to time. 
The genetic code is still a subject of more or less intensive investigation from mathematical, 
physical, chemical, biological and bioinformation point of view. 

It is worth recalling the emergence of special theory of relativity and quantum mechanics. 
They both appeared as a result of unsatisfactory attempts to extend classical theory to 
new physical phenomena, invention of appropriate new physical concepts and use of suitable 
new mathematical methods. Although far from everyday experience these two new theories 
describe physical reality quite successfully. We believe that a similar situation should happen 
in theoretical description of living processes in biological organisms. To this end, ultrametric 
and p-adic methods seem to be very promising tools in further investigation of life. 

Here we want to emphasize the role of ultrametric distance, and in particular, p-adic one. 
Namely, some parts of a biological system can be considered simultaneously with respect to 
different metrics - the usual Euclidean metric, which measures spatial distances, and some 
other metrics, which measure nearness related to some bioinformation (or other) properties. 

The general notion of metric space (M, d) is introduced in 1906 by Maurice Frechet (1878- 
1973), where M is a set and d is a distance function. Distance d is a real- valued function of 



any two elements x,y £ M which must satisfy the following three properties: 



(i) d(x,y) =04^x = y, 

(ii) d(x,y) = d(y,x), 

(iii) d(x,y) < d(x, z) + d(z,y) 



(1) 
(2) 
(3) 



where last property is called triangle inequality. An ultrametric space is a metric space which 
satisfies strong triangle inequality, i.e. 



Word ultrametric is introduced in 1944 by Marc Krasner (1912-1985), although examples 
of ultrametric spaces have been known earlier under different names. An important class of 
ultrametric spaces contains fields of p-adic numbers, which are introduced in 1897 by Kurt 
Hensel (1861-1941). Taxonomy, which started 1735 by Carl Linne (1707-1778) as biological 
classification with hierarchical structure, is another significant example of ultrametricity [2]. 

In this article we consider some aspects of the genetic code using an ultrametric space, 
which elements are codons presented with some natural numbers and the distance between 
them is the p-adic one. However, to have a self-contained and comprehensible exposition of 
the genetic code and its connection with number theory, we shall first briefly review some 
basic notions from molecular biology. 

2 Some Notions of Molecular Biology 

One of the essential characteristics that differentiate a living organism from all other ma- 
terial systems is related to its genome. The genome of an organism is its entire hereditary 
information encoded in the desoxyribonucleic acid (DNA), and contains both genes and non- 
coding sequences. In some viruses genetic material is encoded in the ribonucleic acid (RNA). 
Investigation of the entire genome is the subject of genomics. 

The DNA is a macromolecule composed of two polynucleotide chains with a double-helical 
structure. Nucleotides consist of a base, a sugar and a phosphate group. Helical backbone 
is a result of the sugar and phosphate groups. There are four bases and they are building 
blocks of the genetic information. They are called adenine (A), guanine (G), cytosine (C) 
and thymine (T). Adenine and guanine are derived from purine, while cytosine and thymine 
from pyrimidine. In the sense of information, the nucleotide and its base represent the same 
object. Nucleotides are arranged along chains of double helix through base pairs A-T and 
C-G bonded by 2 and 3 hydrogen bonds, respectively. As a consequence of this pairing there 
is an equal number of cytosine and guanine as well as the equal rate of adenine and thymine. 
DNA is packaged in chromosomes, which are localized in the nucleus of the eukaryotic cells. 

The main role of DNA is to store genetic information and there are two main processes to 
exploit this information. The first one is replication, in which DNA duplicates giving two new 
DNA containing the same information as the original one. This is possible owing to the fact 
that each of two chains contains complementary bases of the other one. The second process 
is related to the gene expression, i.e. the passage of DNA gene information to proteins. It is 
performed by the messenger ribonucleic acid (mRNA), which is usually a single polynucleotide 
chain. The mRNA is synthesized during the first part of this process, known as transcription, 
when nucleotides C, A, T, G from DNA are respectively transcribed into their complements 
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G, U, A, C in mRNA, where T is replaced by U (U is the uracil, which is a pyrimidine) . 
The next step in gene expression is translation, when the information coded by codons in the 
mRNA is translated into proteins. In this process transfer tRNA and ribosomal rRNA also 
participate. 

Codons are ordered trinucleotides composed of C, A, U (T) and G. Each of them presents 
information which controls use of one of the 20 standard amino acids or stop signal in synthesis 
of proteins. 

Protein synthesis in all eukaryotic cells is performed in the ribosomes of the cytoplasm. 
Proteins [3] are organic macromolecules composed of amino acids arranged in a linear chain. 
Amino acids [4] are molecules that consist of amino, carboxyl and R (side chain) groups. 
Depending on R group there are 20 standard amino acids. These amino acids are joined 
together by a peptide bond. Proteins are substantial ingredients of all living organisms 
participating in various processes in cells and determining the phenotype of an organism. 
The study of proteins, especially their structure and functions, is called proteomics. The 
proteome is the entire set of proteins in an organism. 

The human genome, which presents all genetic information of the Homo sapiens, is com- 
posed of about 3- 10 9 DNA base pairs and contains about 3- 10 4 genes [5]. In the human body 
there may be about 2 million different proteins. The sequence of amino acids in a protein 
is determined by the sequence of codons contained in the corresponding DNA gene. After 
transcription of a gene from DNA to mRNA there is a maturation of the primary sequence 
of codons to the final one which determine primary structure of the corresponding protein. 
Thus not only DNA but also RNA play important role in the gene expression. For more 
detailed and comprehensive information on molecular biology and the genetic code one can 
refer to [5, 6]. 

3 Genetic Code 

The relation between codons and amino acids is known as the genetic code [7]. From math- 
ematical point of view, the genetic code is a map from the set of 64 codons to the set of 20 
amino acids and one stop signal. 

So far there are about 20 known versions of the genetic code (see, e.g. [8]), but the most 
important are two of them: the standard code and the vertebral mitochondrial code. 

In the sequel we shall mainly have in mind the vertebral mitochondrial code, because 
it is a simple one and the others may be regarded as its slight modifications. There are 

4 x 4 x 4 = 64 codons. In the vertebral mitochondrial code, 60 of codons are related to the 20 
different amino acids and 4 stop codons make termination signals. According to experimental 
observations, two amino acids are coded by six codons, six amino acids by four codons, and 
twelve amino acids by two codons. This property that some amino acids are coded by more 
than one codon is known as genetic code degeneracy. This degeneracy is a very important 
property of the genetic code and gives an efficient way to minimize errors caused by mutations 
and translation. 

There is in principle up to 21 64 of all possible mappings from 64 codons to 20 amino acids 
and one stop signal. It is obvious that some of them cannot ply role of the genetic code. Since 
there is still a huge number of possibilities for genetic codes and only a very small number 
of them is represented in living cells, it has been a persistent theoretical challenge to find an 
appropriate approach explaining about 20 contemporary genetic codes. 
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The first genetic model was proposed in 1954 by physicist George Gamow (1904-1968), 
which he called the diamond code. In his model codons are composed of three nucleotides and 
proteins are directly synthesized at DNA: each cavity at DNA attracts one of 20 amino acids. 
This is an overlapping code and was ruled out by analysis of correlations between amino 
acids in proteins. The next model of the genetic code was proposed in 1957 by Crick, and 
is known as the comma-free code. This model was so elegant that it was almost universally 
accepted. However, an experiment in 1961 demonstrated that UUU codon codes amino acid 
phenylalanine, while by the comma-free code it codes nothing. Gamow's and Crick's models 
are very pretty but wrong - living world prefers actual codes, which are more stable with 
respect to possible errors (for a popular review of the early models, see [1]). 

An intensive study of the connection between ordering of nucleotides in DNA (and RNA) 
and ordering of amino acids in proteins led to the experimental deciphering of the standard 
genetic code in the mid-1960s. The genetic code is understood as a dictionary for translation 
of information from DNA (through RNA) to synthesis of proteins by amino acids. The 
information on amino acids is contained in codons: each codon codes either an amino acid 
or termination signal (see, e.g. a table of the vertebral mitochondrial code). To the sequence 
of codons in RNA corresponds quite definite sequence of amino acids in a protein, and this 
sequence of amino acids determines primary structure of the protein. 

At the time of deciphering, it was mainly believed that the standard code is unique, 
result of a chance and fixed a long time ego. Crick [9] expressed such belief in his "frozen 
accident" hypothesis, which has not been supported by later observations. Moreover, so far 
at least 20 different codes have been discovered and some general regularities found. At first 
glance the genetic code looks rather arbitrary, but it is not. Namely, mutations between 
synonymous codons give the same amino acid. When mutation alters an amino acid then it 
is like substitution of the original by a similar one. In this respect the code is almost optimal. 

Despite of remarkable experimental successes, there is no simple and generally accepted 
theoretical understanding of the genetic code. There are many papers in this direction, 
scattered in various journals, with theoretical approaches based more or less on chemical, 
physical, biological and mathematical aspects of the genetic code. However, the foundation of 
biological coding is still an open problem. In particular, it is not clear why genetic code exists 
just in few known ways and not in many other possible ones. What is a principle (or principles) 
employed in establishment of a basic (mitochondrial) code? What are properties of codons 
connecting them into definite multiplets which code the same amino acid or termination 
signal? 

Let us mention some models of the genetic code after deciphering standard code. In 
1966 physicist Yuri Rumer (1901-1985) emphasized the role of the first two nucleotides in 
the codons [10]. There are models which are based on chemical properties of amino acids 
(see, e.g. [11]). In some models connections between number of constituents of amino acids 
and nucleotides and some properties of natural numbers are investigated (see [12, 13] and 
references therein). A model based on the quantum algebra U„(sl(2)(Bsl(2)) in the q — > limit 
was proposed as a symmetry algebra for the genetic code (see [14] and references therein). In a 
sense this approach mimics quark model of baryons. Besides some successes of this approach, 
there is a problem with rather many parameters. There are also papers (see, e.g. [15], [16] 
and [17]) starting with 64-dimensional irreducible representation of a Lie (super)algebra and 
trying to connect multiplicity of codons with irreducible representations of subalgebras arising 
in a chain of symmetry breaking. Although interesting as an attempt to describe evolution 
of the genetic code these Lie algebra approaches did not progress further. For a very brief 
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review of these and some other theoretical approaches to the genetic code one can see [14]. 
There is still no generally accepted explanation of the genetic code. 



4 Some Mathematical Preliminaries and p-Adic Codon Space 

As a new tool to study the Diophantine equations, p-adic numbers are introduced by Ger- 
man mathematician Kurt Hensel in 1897. They are involved in many branches of modern 
mathematics. An elementary introduction to p-adic numbers can be found in the book [18]. 
However, for our purposes we will use here only a small portion of p-adics, mainly some finite 
sets of integers and ultrametric distances between them. 
Let us introduce the set of natural numbers 



where nj are digits related to nucleotides by the following assignments: C (cytosine) = 1, A 
(adenine) = 2, T (thymine) = U (uracil) = 3, G (guanine) = 4. This is a finite expansion to 
the base 5. It is obvious that 5 is a prime number and that the set Cs[64] contains 64 numbers 
between 31 and 124 in the usual base 10. In the sequel we shall often denote elements of 
Cs[64] by their digits to the base 5 in the following way: no + n\ 5 + n-i 5 2 = no n\ n%. Note 
that here ordering of digits is the same as in the expansion, i.e this ordering is opposite to the 
usual one. There is now an evident one-to-one correspondence between codons in three-letter 
notation and number no n\ n<i representation. 

There is no summation, subtraction, multiplication and division on the codon space. A 
mapping of codons to codons is possible by replacement of a nucleotide by another. In other 
words, there is a sense interchange of digits on the space C5 [64], but not standard arithmetic 
operations (summation, subtraction, multiplication and division). 

It is also often important to know a distance between numbers. Distance can be defined 
by a norm. On the set Z of integers there are two kinds of nontrivial norm: usual absolute 
value I • |oo and p-adic absolute value | ■ \ p , where p is any prime number. The usual absolute 
value is well known from elementary mathematics and the corresponding distance between 
two numbers x and y is doo(x, y) = | x y|oo- 

The p-adic absolute value is related to the divisibility of integers by prime number p. 
Difference of two integers is again an integer. p-Adic distance between two integers can be 
understood as a measure of divisibility by p of their difference (the more divisible, the shorter). 
By definition, p-adic norm of an integer m € Z, is \m\ p = p~ k , where k € N|J{0} is degree 
of divisibility of m by prime p (i.e. m = p k m' , where m! is not divisible by p) and |0| p = 0. 
N and Z are the set of natural numbers and the set of integers, respectively. This norm is a 
mapping from Z into non-negative rational numbers and has the following properties: 

(i) \x\ p > 0, \x\ p = if and only if x = 0, 

(ii) \xy\ p = \x\ p \y\ p , 

(iii) \x + y\ p < max {\x\ p , \y\ p } < \x\ p + \y\ p for all x , y € Z. 

Because of the strong triangle inequality \x + y\ p < max{|x| p , \y\ p }, p-adic absolute value 
belongs to non- Archimedean (ultrametric) norm. One can easily conclude that < \m\ p < 1 
for any m £ Z and any prime p. 

p-Adic distance between two integers x and y is 



C 5 [64] = {n + m 5 + n 2 5 2 : m = 1, 2, 3, 



4} 



(5) 



d p (x,y) 



x-y\ p . 



(6) 



5 



Since p-adic absolute value is ultrametric, the p-adic distance (6) is also ultrametric, i.e. it 
satisfies 

d p (x , y) < max {d p (x , z) , d p (z , y)} < d p (x , z) + d p (z , y) , (7) 

where x, y and z are any three integers. 

The above introduced set C5 [64] endowed by p-adic distance we shall call p-adic codon 
space, i.e. elements of C5 [64] are codons denoted by n§n\n 2 . 5-Adic distance between two 
codons a, b € C5 [64] is 

d 5 (a, b) = \a + a\ 5 + a 2 5 2 - b - b\ 5 - b 2 5 2 | 5 , (8) 
where m, bi E {1,2,3,4}. When a/6 then d${a, b) may have three different values: 

• d 5 (a, b) = 1 if a ^ b , 

• ds(a, b) = 1/5 if ao = 60 and a\ 7^ b\, 

• (^5(0, 6) = 1/5 2 if ao = 60 j o,\ = b\ and a 2 7^ 62- 

We see that the largest 5-adic distance between codons is 1 and it is the maximum p-adic 
distance on Z. The smallest 5-adic distance on the codon space is 5 -2 . 

If we apply real (standard) distance d oc (a, b) = [ao + a\ 5 + a 2 5 2 — 60 — b\ 5 — 62 5 2 |oo, then 
third nucleotides a 2 and b 2 would play more important role than those at the second position 
(i.e ai and 61), and nucleotides ao and 60 are of the smallest importance. In real Cs[64] space 
distances are also discrete, but take values 1, 2, • • • ,93. The smallest real and the largest 
5-adic distance are equal to 1. While real distance describes spatial separation, this p-adic 
one serves to describe information nearness on the codon space. 

It is worth emphasizing that the metric role of digits depends on their position in number 
expansion and it is quite opposite in real and p-adic cases. We shall see later that the first 
two nucleotides in a codon are more important than the third one and that p-adic distance 
between codons is a natural one in description of their information content (the nearer, the 
more similar meaning). 



5 p-Adic Genetic Code 

Modelling of the genetic code, the genome and proteins is a challenge as well as an opportunity 
for application of p-adic distances. Recently [19, 20, 21], it was introduced and considered 
a p-adic approach to DNA and RNA sequences, genome and the genetic code. The central 
point of this approach is an appropriate identification of four nucleotides with digits 1, 2, 3, 4 
of 5-adic representation of some positive integers and application of p-adic distances between 
obtained numbers. 5-Adic numbers with three digits form 64 integers which correspond to 
64 codons. It is unappropriate to use the digit for a nucleotide because it leads to non- 
uniqueness in representation of the codons by natural numbers. For example, 123 = 123000 
as numbers, but 123 would represent one and 123000 two codons. This is also a reason why 
we do not use 4-adic representation for codons, since it would contain a nucleotide presented 
by digit 0. One can use as a digit to denote absence of any nucleotide. As one of the main 
results that we have obtained is explanation of the structure of the genetic code degeneracy 
using p-adic distance between codons. A similar approach to the genetic code was later 
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TABLE I. Table of the vertebral mitochondrial code in the 5-adic and three-letter notation. 



111 CCC Pro 

112 CCA Pro 

113 CCU Pro 

114 CCG Pro 


211 ACC Thr 

212 ACA Thr 

213 ACU Thr 

214 ACG Thr 


311 UCC Ser 

312 UCA Ser 

313 UCU Ser 

314 UCG Ser 


411 GCC Ala 

412 GCA Ala 

413 GCU Ala 

414 GCG Ala 


121 CAC His 

122 CAA Gin 

123 CAU His 

124 CAG Gin 


221 AAC Asn 

222 AAA Lys 

223 AAU Asn 

224 AAG Lys 


321 UAC Tyr 

322 UAA Ter 

323 UAU Tyr 

324 UAG Ter 


421 GAC Asp 

422 GAA Glu 

423 GAU Asp 

424 GAG Glu 


131 CUC Leu 

132 CUA Leu 

133 CUU Leu 

134 CUG Leu 


231 AUC He 

232 AUA Met 

233 AUU lie 

234 AUG Met 


331 UUC Phe 

332 UUA Leu 

333 UUU Phe 

334 UUG Leu 


431 GUC Val 

432 GUA Val 

433 GUU Val 

434 GUG Val 


141 CGC Arg 

142 CGA Arg 

143 CGU Arg 

144 CGG Arg 


241 AGC Ser 

242 AGA Ter 

243 AGU Ser 

244 AGG Ter 


341 UGC Cys 

342 UGA Trp 

343 UGU Cys 

344 UGG Trp 


441 GGC Gly 

442 GGA Gly 

443 GGU Gly 

444 GGG Gly 



considered on diadic plane [22], and recently [23] 2-adic distance was applied to the PAM 
matrix in bioinformatics. 

Let us mention that p-adic models in mathematical physics have been actively considered 
since 1987 (see [24], [25] for early reviews and [26, 27, 28] for some recent reviews). It is worth 
noting that p-adic models with pseudo differential operators have been successfully applied to 
interbasin kinetics of proteins [29]. Some p-adic aspects of cognitive, psychological and social 
phenomena have been also considered [30]. 

Let us now turn to Table I. We observe that this table can be regarded as a big rect- 
angle divided into 16 equal smaller rectangles: 8 of them are quadruplets which one-to-one 
correspond to 8 amino acids, and another 8 rectangles are divided into 16 doublets coding 14 
amino acids and termination (stop) signal (by two doublets at different places). There is a 
manifest symmetry in distribution of these quadruplets and doublets. Namely, quadruplets 
and doublets form separately two figures, which are symmetric with respect to the mid verti- 
cal line (a left-right symmetry), i.e. they are invariant under interchange C *-* G (1 4) and 
A <-> U (2 <-> 3) at the first position in codons at all horizontal lines. In other words, at each 
horizontal line one can perform doublet <-> doublet and quadruplet <-> quadruplet interchange 
around vertical midline. Recall that also DNA is symmetric under simultaneous interchange 
of complementary nucleotides C <-> G and A <-> T between its strands. All doublets in this 
table form a nice figure which looks like letter T. 

It is worth noting that the above invariance leaves also unchanged polarity and hydropho- 
bicity of the corresponding amino acids in all but three cases: Asn <-> Tyr, Arg «-> Gly, and 
Ser <->■ Cys. 
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5.1 Degeneracy of the genetic code 

Let us now explore distances between codons and their role in formation of the genetic code 
degeneration. 

To this end let us again turn to Table I as a representation of the C5 [64] codon space. 
Namely, we observe that there are 16 quadruplets such that each of them has the same first 
two digits. Hence 5-adic distance between any two different codons within a quadruplet is 

d 5 (a, b) = \a + a\ 5 + a 2 5 2 - a - ai 5 - b 2 5 2 | 5 

= I (02 - b 2 ) 5 2 | 5 = I (02 - 62)|s |5 2 | 5 = 5" 2 , (9) 

because ao = bo, a\ = b\ and |a 2 — ^2 Is = 1- According to (9) codons within every quadruplet 
are at the smallest distance, i.e. they are nearest compared to all other codons. 

Since codons are composed of three arranged nucleotides, each of which is either a purine 
or a pyrimidine, it is natural to try to quantify nearness inside purines and pyrimidines, as 
well as distance between elements from these two groups of nucleotides. Fortunately there is 
a tool, which is again related to the p-adics, and now it is 2-adic distance. One can easily see 
that 2-adic distance between pyrimidines C and U is d 2 (l, 3) = 1 3 — 1 1 2 = 1/2 as the distance 
between purines A and G, namely d<z{2, 4) = 1 4 — 2 1 2 = 1/2. However 2-adic distance between 
C and A or G as well as distance between U and A or G is 1 (i.e. maximum). 

With respect to 2-adic distance, the above quadruplets may be regarded as composed of 
two doublets: a = ao a\ 1 and b = ao ai 3 make the first doublet, and c = ao ai 2 and d = ao a\ 4 
form the second one. 2-Adic distance between codons within each of these doublets is ^, i.e. 

d 2 (a, b) = |(3 - 1) 5 2 | 2 = 1 d 2 (c, d) = |(4 - 2) 5 2 | 2 = i (10) 

because 3 — 1 = 4 — 2 = 2. 

One can now look at Table I as a system of 32 doublets. Thus 64 codons are clustered by 
a very regular way into 32 doublets. Each of 21 subjects (20 amino acids and 1 termination 
signal) is coded by one, two or three doublets. In fact, there are two, six and twelve amino acids 
coded by three, two and one doublet, respectively. Residual two doublets code termination 
signal. 

Note that 2 of 16 doublets code 2 amino acids (Ser and Leu) which are already coded by 
2 quadruplets, thus amino acids Serine and Leucine are coded by 6 codons (3 doublets). 

To have a more complete picture on the genetic code it is useful to consider possible 
distances between codons of different quadruplets as well as between different doublets. Also, 
we introduce distance between quadruplets or between doublets, especially when distances 
between their codons have the same value. Thus 5-adic distance between any two quadruplets 
in the same column is 1/5, while such distance between other quadruplets is 1. 5-Adic distance 
between doublets coincides with 5-adic distance between quadruplets, and this distance is 
when doublets are within the same quadruplet. 

The 2-adic distances between codons, doublets and quadruplets are more complex. There 
are three basic cases: 

• codons differ only in one digit, 

• codons differ in two digits, 
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• codons differ in all three digits. 

In the first case, 2-adic distance can be \ or 1 depending whether difference between digits is 
2 or not, respectively. 

Let us now look at 2-adic distances between doublets coding leucine and also between 
doublets coding serine. These are two cases of amino acids coded by three doublets. One has 
the following distances: 

• d 2 (332, 132) = d 2 (334, 134) = \ for leucine, 

• d 2 (311,241) = d 2 (313,243) = \ for serine. 

If we use usual distance between codons, instead of p-adic one, then we would observe 
that two synonymous codons are very far, and that those which are close code different amino 
acids. Thus we conclude that not usual metric but ultrametric is inherent to codons. 

How is degeneracy of the genetic code related to p-adic distances between codons? The 
answer is in the following p-adic degeneracy principle: Two codons have the same meaning 
with respect to amino acids if they are at smallest 5-adic and 1/2 2-adic distance. Here p- 
adic distance plays a role of similarity: the closer, the more similar. Taking into account 
all known codes (see the next subsection) there is a slight violation of this principle. Now 
it is worth noting that in modern particle physics just broken fundamental gauge symmetry 
gives its standard model. There is a sense to introduce a new principle (let us call it reality 
principle): Reality is realization of some broken fundamental principles. It seems that this 
principle is valid not only in physics but also in all sciences. In this context modern genetic 
code is an evolutionary broken the above p-adic degeneracy principle. 

5.2 Evolution of the genetic code 

The origin and early evolution of the genetic code are among the most interesting and im- 
portant investigations related to the origin and evolution of the life. However, since there are 
no fossils of organisms from that very early period of life, it gives rise to many speculations. 
Nevertheless, one can hope that some of the hypotheses may be tested looking for their traces 
in the contemporary genomes. 

It seems natural to consider biological evolution as an adaptive development of simpler 
living systems to more complex ones. Namely, living organisms are open systems in permanent 
interaction with environment. Thus the evolution can be modelled by a system with given 
initial conditions and guided by some internal rules taking into account environmental factors. 

We are going now to conjecture on the evolution of the genetic code using our p-adic 
approach to the genomic space, and assuming that preceding codes used simpler codons and 
older amino acids. 

Recall that p-adic codon space C p [(p — l) m ] has two parameters: p - related to p — 1 
building blocks, and m - multiplicity of the building blocks (nucleotides) in space elements 
(codons). 

• Case C 2 [l] is a trivial one and useless for a primitive code. 

• Case C3 [2 m ] with m = 1,2,3 does not seem to be realistic. 

• Case C5 [4 m ] with m = 1,2,3 offers a possible pattern to consider evolution of the 
genetic code. Namely, the codon space could evolve in the following way: C5 [4] — » 
C 5 [4 2 ] ^C 5 [4 3 ] =C 5 [64]. 
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TABLE II Temporal appearance of the 20 standard amino acids [31]. 



(1) Glycine, G 


(2) Alanine, A 


(3) Aspartate, D 


(4) Valine, V 


(5) Proline, P 


(6) Serine, S 


(7) Glutamate, E 


(8) Leucine, L 


(9) Threonine, T 


(10) Arginine, R 


(11) Isoleucine, I 


(12) Glutamine, Q 


(13) Asparagine, N 


(14) Histidine, H 


(15) Lysine, H 


(16) Cysteine, C 


(17) Phenylalanine, F 


(18) Tyrosine, Y 


(19) Methionine, M 


(20) Tryptophan, W 



The primary code, containing codons in the single nucleotide form (C, A, U, G), encoded 
temporally appeared the first four amino acids [31]: Gly, Ala, Asp and Val (see Table II). 
From the last column of Table I we conclude that the connection between digits and amino 
acids is: 1 = Ala, 2 = Asp, 3 = Val, 4 = Gly. In the primary code these digits occupied the 
first position in the 5-adic expansion, and at the next step, i.e. C5 [4] — ► C5 [4 2 ] , they moved 
to the second position adding digits 1,2,3,4 in front of each of them. 

It is worth noting that traces of some early peptides composed of the first four amino 
acids G, A, D, and V have been found recently [34] in the form of three motifs containing 
DGD submotif in some present-day proteins. This is in agreement with our conjecture on 
existence of the single nucleotide primary code at the very beginning of life. 

In C5 [4 2 ] one has 16 dinucleotide codons which can code up to 16 amino acids. Addition 
of the digit 4 in front of already existing codons 1, 2, 3, 4 leaves their meaning unchanged, i.e. 
41 = Ala, 42 = Asp, 43 = Val, and 44 = Gly. Adding digits 3, 2, 1 in front of the primary 
1,2,3,4 codons one obtains 12 possibilities for coding some new amino acids. To decide 
which amino acid was encoded by which of 12 dinucleotide codons, we use as a criterion 
their immutability in the trinucleotide coding on the C5 [4 3 ] space. This criterion assumes 
that amino acids encoded earlier have more stable place in the genetic code table than those 
encoded later. According to this criterion we decide in favor of the first row in each rectangle 
of Table I and result is presented in Table III. 

Transition from dinucleotide to trinucleotide codons occurred by attaching nucleotides 
1,2,3,4 at the third position, i.e. behind each dinucleotide. By this way one obtains new 
codon space C5 [4 3 ] = C5 [64], which is significantly enlarged and provides a pattern to generate 
known contemporary genetic codes. This codon space C5 [64] gives possibility to realize at 
least three general properties of the modern code: 

(i) encoding of more than 16 amino acids, 

(ii) diversity of codes, 

(hi) stability of the gene expression. 
Let us give some relevant clarifications. 

(i) For functioning of contemporary living organisms it is necessary to code at least 20 
standard (Table II) and 2 non-standard amino acids (selenocysteine and pyrrolysine) . Prob- 
ably these 22 amino acids are also sufficient building units for biosynthesis of all necessary 
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TABLE III The dinucleotide genetic code based on the p-adic genomic space C5 [4 2 ]. Note 
that it encodes 15 amino acids without stop codon, but encoding serine twice. 



11 CC Pro 


21 AC Thr 


31 UC Ser 


41 GC Ala 


12 CA His 


22 AA Asn 


32 UA Tyr 


42 GA Asp 


13 CU Leu 


23 AU He 


33 UU Phe 


43 GU Val 


14 CG Arg 


24 AG Ser 


34 UG Cys 


44 GG Gly 



contemporary proteins. While C5 [4 2 ] is insufficient, the genomic space C5 [4 3 ] offers approx- 
imately three codons per one amino acid. 

(ii) The standard (often called universal) code was established around 1966 and was 
thought to be universal, i.e., common to all organisms. When the vertebral mitochondrial 
code was discovered in 1979, it gave rise to belief that the code is not frozen and that there are 
also some other codes which are mutually different. According to later evidence, one can say 
that there are at least 20 slightly different mitochondrial and nuclear codes (for a review, see 
[7, 8, 32] and references therein). Different codes have some codons with different meaning. 
So, in the standard code there are the following changes in Table I: 

• 232 (AUA): Met -> lie, 

• 242 (AGA) and 244 (AGG): Ter -> Arg, 

• 342 (UGA): Trp -> Ter. 

Modifications in 20 known codes are not homogeneously distributed on 16 rectangles of Table 
I. For instance, in all 20 codes codons 41i (i = 1,2,3,4) have the same meaning. 

(iii) Each of the 20 codes is degenerate and degeneration provides their stability against 
possible mutations. In other words, degeneration helps to minimize codon errors. 

Genetic codes based on single nucleotide and dinucleotide codons were mainly directed to 
code amino acids with rather different properties. This may be the reason why amino acids 
Glu and Gin are not coded in dinucleotide code (Table II), since they are similar to Asp and 
Asn, respectively. However, to become almost optimal, trinucleotide codes have taken into 
account structural and functional similarities of amino acids. 

We presented here a hypothesis on the genetic code evolution taking into account possible 
codon evolution, from 1-nucleotide to 3-nucleotide, and amino acids temporal appearance. 
This scenario may be extended to cell evolution, which probably should be considered as a 
coevolution of all its main ingredients (for an early idea of the coevolution, see [33]). 



6 Concluding Remarks 

There are two important aspects of the genetic code which are related to: 
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(i) multiplicity of codons which code the same amino acid, 

(ii) assignment of codon multiplets to specific amino acids. 

The above presented p-adic approach gives quite satisfactory description of the aspect (i). 
Ultrametric behavior of p-adic distances between elements of the C5 [64] codon space radically 
differs from the usual ones. Quadruplets and doublets of codons have a natural explanation 
within 5-adic and 2-adic nearness. Degeneracy of the genetic code in the form of doublets, 
quadruplets and sextuplets is a direct consequence of p-adic ultrametricity between codons. 
p-Adic C5 [64] codon space is our theoretical pattern to consider all variants of the genetic 
code: some codes are direct representation of C5 [64] and the others are its slight evolutional 
modifications. 

(ii) Which amino acid corresponds to which multiplet of codons? An answer to this ques- 
tion should be expected from connections between physicochemical properties of amino acids 
and anticodons. Namely, enzyme aminoacyl-tRNA synthetase links specific tRNA anticodon 
and related amino acid. Thus there is no direct interaction between amino acids and codons, 
as it was believed in Gamow's time. 

Note that there are in general 4! ways to assign digits 1,2,3,4 to nucleotides C, A, U, G. 
After an analysis of all 24 possibilities, we have taken C = 1, A = 2, U = T = 3, G = 4as 
a quite appropriate choice. In addition to various properties already presented in this paper, 
the DNA base pairs exhibit relation C + G = A + T = 5. 

One can express many of the above considerations on p-adic information theory in lin- 
guistic terms and investigate possible linguistic applications. 

In this paper we have employed p-adic distances to measure nearness between codons, 
which have been used to describe degeneracy of the genetic code. It is worth noting that in 
other contexts p-adic distances can be interpreted in quite different meanings. For example, 3- 
adic distance between cytosine and guanine is 6^(1, 4) = |, and between adenine and thymine 
ds(2, 3) = 1. This 3-adic distance seems to be natural to relate to hydrogen bonds between 
complements in DNA double helix: the smaller the distance, the stronger the hydrogen bond. 
Recall that C-G and A-T are bonded by 3 and 2 hydrogen bonds, respectively. 

The translation of codon sequences into proteins is highly information-processing phe- 
nomenon. p-Adic information modelling presented in this paper offers a new approach to 
systematic investigation of ultrametric aspects of DNA and RNA sequences, the genetic code 
and the world of proteins. It can be embedded in computer programs to explore the p-adic 
side of the genome and related subjects. 

The above considerations and obtained results may be regarded as contributions towards 
foundations of (i) p-adic theory of information and (ii) p-adic theory of the genetic code. 

(i) Contributions to p-adic theory of information contain: 

• formulation of p-adic genomic space (whose examples are spaces of nucleotides, dinu- 
cleotides and trinucleotides), 

• relation between building blocks of information spaces and some prime numbers; 

(ii) Contributions to p-adic theory of the genetic code include: 

• description of codon quadruplets and doublets by 5-adic and 2-adic distances, 

• observation of a symmetry between quadruplets as well as between doublets at our table 
of codons, 
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• formulation of degeneracy principle, 

• formulation of hypothesis on codon evolution. 

Many problems remain to be explored in the future on the above p-adic approach to 
genomics. Among the most attractive and important themes are: 

• elaboration of the p-adic theory of information towards genomics and proteomics, 

• evolution of the genome and the genetic code, 

• structure and function of non-coding DNA, 

• creation of the corresponding computer programs. 
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